P10218 

(^(tttiFEftFiffl 1 2 may Hf 

Automatic Speech Recognition System 

FIELD OF THE INVENTION 

The present invention relates to an automatic speech recognition system 
and, more particularly, to an automatic speech recognition system which is 
able to recognize speeches with high accuracy, when a speaker and a moving 
object having an automatic speech recognition system are moving around. 

BACKGROUND OF THE INVENTION 

A technique for speech recognition, which has been recently developed 
so much as to reach practical use, has been started to apply to an area such as 
inputting of information in the form of speech. Also research and development 
of robots has been flourishing, which induces a situation in which the 
technique for speech recognition technically plays a key role in putting a robot 
to practical use. This is ascribed to the fact that intelligently social interaction 
between a robot and a human requires the former to understand human 
language, increasing the importance of accuracy achieved in speech 
recognition. 

There are several problems in conducting communication with a speaker, 
different from speech recognition, which is carried out in a laboratory by 
inputting speeches through a microphone which is placed near a mouth of the 
speaker. 

For example, since there are various types of noise in an actual 
environment, it is not possible to succeed in speech recognition unless 
necessary speech signals are separated from the noise. When there is a 
plurality of speakers, it is necessary to extract speeches of a specified speaker 
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to be recognized. A Hidden Markov Model (HMM) is generally used for speech 
recognition. This model is not free of a problem that a recognition rate is 
adversely affected by the fact that a voice of a speaker sounds different 
according to positions of the speaker (relative to a microphone of an automatic 
speech recognition system). 

A research group including the inventors of the present invention 
disclosed a technique that performs localization, separation and recognition of 
a plurality of sound sources by active audition (see no-patent document 1). 

This technique, which has two microphones provided at positions 
corresponding to ears of a human, enables recognition of words uttered by one 
speaker when a plurality of speakers simultaneously utter words. More 
specifically speaking, the technique localizes the speakers based on acoustic 
signals entered through the two microphones and separates speeches for each 
speaker so as to recognize them. In this recognition, acoustic models are 
generated beforehand, which are adjusted to directions covering a range of 
-90° to 90° at intervals of 10° as viewed from a moving object (such as a 
robot having an automatic speech recognition system). When speech 
recognition is performed, processes with these acoustic models are carried out 
in parallel. 

No-patent document 1: "A humanoid Listens to three simultaneous talkers by 
Integrating Active Audition and Face Recognition" Kazuhiro Nakadai, et al., 
IJCAI-03 Workshop on Issues in Designing Physical Agents for Dynamic 
Real-Time Environments: World Modeling, Planning, Learning and 
Communicating, PP117-124 
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SUMMARY OF THE INVENTION 

The conventional technique described above has posed a problem that 
because a position of the speaker changes with respect to the moving object 
each time the speaker and the moving object relatively move, a recognition 
rate decreases if the speaker stands at a position, for which an acoustic model 
is not prepared in advance. 

The present invention, which is created in view of the background 
described above, provides an automatic speech recognition system which is 
able to recognize with high accuracy while a speaker and a moving object are 
moving around. 

It is an aspect of the present invention to provide an automatic speech 
recognition system, which recognizes speeches in acoustic signals detected by a 
plurality of microphones as character information. The system comprises a 
sound source localization module, a feature extractor, an acoustic model 
memory, an acoustic model composition module and a speech recognition 
module. The sound source localization module localizes a sound direction 
corresponding to a specified speaker based on the acoustic signals detected by 
the plurality of microphones. The feature extractor extracts features of speech 
signals contained in one or more pieces of information detected by the plurality 
of microphones. The acoustic model memory stores direction-dependent 
acoustic models that are adjusted to a plurality of directions at intervals. The 
acoustic model composition module composes an acoustic model adjusted to the 
sound direction, which is localized by the sound source localization module, 
based on the direction-dependent acoustic models in the acoustic model 
memory. The acoustic model composition module also stores the acoustic model 
in the acoustic model memory. The speech recognition module recognizes the 
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features extracted by the feature extractor as character information using the 
acoustic model composed by the acoustic model composition module. 

In the automatic speech recognition system described above, the sound 
source localization module localizes a sound direction, the acoustic model 
composition module composes an acoustic model adjusted to a direction based 
on the sound direction and direction-dependent acoustic models and the speech 
recognition module performs speech recognition with the acoustic model. 

It may be preferable, but not necessarily, that the automatic speech 
recognition system includes the sound source separation module which 
separates the speech signals of the specified speaker from the acoustic signals, 
and the feature extractor extracts the features of the speech signals based on 
the speech signals separated by the sound source separation module. 

In the automatic speech recognition system described above, the sound 
source localization module localizes the sound direction and the sound source 
separation module separates only the speeches corresponding to the sound 
direction localized by the sound source localization module. The acoustic model 
composition module composes the acoustic model corresponding to the sound 
direction based on the sound direction and the direction-dependent acoustic 
models. The speech recognition module carries out speech recognition with this 
acoustic model. 

In this connection, the speech signals delivered by the sound source 
separation module are not limited to analogue speech signals, but they may 
include any type of information as long as it is meaningful in terms of speech, 
such as digitized signals, coded signals and spectrum data obtained by 
frequency analysis. 

It may be possible that the sound source localization module is 
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configured to execute a process comprising: performing a frequency analysis 
for the acoustic signals detected by the microphones to extract harmonic 
relationships; acquiring an intensity difference and a phase difference for the 
harmonic relationships extracted through the plurality of microphones; 
5 acquiring belief factors for a sound direction based on the intensity difference 
and the phase difference, respectively; and determining a most probable sound 
direction. 

It may be possible that the sound source localization module employs 
scattering theory that generates a model for an acoustic signal, which scatters 

10 on a surface of a member, such as a head of a robot, to which the microphones 
are attached, according to a sound direction so as to specify the sound direction 
for the speaker with the intensity difference and the phase difference detected 
through the plurality of microphones. 

It may be preferable, but not necessarily, that the sound source 

15 separation module employs an active direction-pass filter so as to separate 
speeches, the filter being configured to execute a process comprising: 
separating speeches by a narrower directional band when a sound direction, 
which is localized by the sound source localization module, lies close to a front, 
which is defined by an arrangement of the plurality of microphones; and 

20 separating speeches by a wider directional band when the sound direction lies 
apart from the front. 

It may be preferable, but not necessarily, that the acoustic model 
composition module is configured to compose an acoustic model for the sound 
direction by applying weighted linear summation to the direction-dependent 

25 acoustic models in the acoustic model memory and weights introduced into the 
linear summation are determined by training. 
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It may be preferable, but not necessarily, that the automatic speech 
recognition system further comprises a speaker identification module, the 
acoustic model memory possesses direction-dependent acoustic models for 
respective speakers, and the acoustic model composition module is configured 
to execute a process comprising: referring to direction-dependent acoustic 
models of a speaker who is identified by the speaker identifying module and to 
a sound direction localized by the sound source localization module; composing 
an acoustic model for the sound direction based on the direction-dependent 
acoustic models in the acoustic model memory; and storing the acoustic model 
in the acoustic model memory. 

It may be preferable, but not necessarily, that the automatic speech 
recognition system further comprises a masking module. The masking module 
conducts a comparison between patterns prepared in advance with the 
features extracted by the feature extractor or the speech signals separated by 
the sound source separation module so as to identify a domain, a frequency 
domain and sub-band, for example, in which a difference with respect to the 
patterns is greater than a predetermined threshold. The masking module 
sends an index indicating that reliability in terms of feature is low for the 
identified domain to the speech recognition module. 

It is another aspect of the present invention to provide an automatic 
speech recognition system, which recognizes speeches in acoustic signals 
detected by a plurality of microphones as character information. The system 
comprises a sound source localization module, a stream tracking module, a 
sound source separation module, a feature extractor, an acoustic model 
memory, an acoustic model composition module and a speech recognition 
module. The sound source localization module localizes a sound direction 
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corresponding to a specified speaker based on the acoustic signals detected by 
the plurality of microphones. The stream tracking module stores the sound 
direction localized by the sound source localization module so as to estimate a 
direction in which the specified speaker is moving. Also the stream tracking 
module estimates a current position of the speaker according to the estimated 
direction. The sound source separation module separates speech signals of the 
specified speaker from the acoustic signals based on a sound direction, which 
is determined by the current position of the speaker estimated by the stream 
tracking module. The feature extractor extracts features of the speech signals 
separated by the sound source separation module. The acoustic model memory 
stores direction-dependent acoustic models that are adjusted to a plurality of 
directions at intervals. The acoustic model composition module composes an 
acoustic model adjusted to the sound direction, which is localized by the sound 
source localization module, based on the direction-dependent acoustic models 
in the acoustic model memory. Also the acoustic model composition module 
stores the acoustic model in the acoustic model memory. The speech 
recognition module recognizes the features extracted by the feature extractor 
as character information using the acoustic model, which is composed by the 
acoustic model composition module. 

The automatic speech recognition system described above, which 
identifies the sound direction of the speech signals generated in an arbitrary 
direction and carries out speech recognition using the acoustic model 
appropriate for the sound direction, is able to increase speech recognition rate. 



BRIEF DESCRIPTION OF THE DRAWINGS 

FIG.l is a block diagram showing an automatic speech recognition 
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system according to an embodiment of the present invention. 

FIG. 2 is a block diagram showing an example of a sound source 
localization module. 

FIG.3 is a schematic diagram illustrating operation of a sound source 
localization module. 

FIG.4 is a schematic diagram illustrating operation of a sound source 
localization module. 

FIG. 5 is a schematic diagram describing auditory epipolar geometry. 

FIG.6 is a graph showing the relationship between phase difference 
and frequency / . 

FIG.7A and FIG.7B are graphs each showing an example of a head 
related transfer function. 

FIG. 8 is a block diagram showing an example of a sound source 
separation module. 

FIG. 9 is a graph showing an example of a pass range function. 

FIG. 10 is a schematic diagram illustrating operation of a subband 
selector. 

FIG. 11 is a plan view showing an example of a pass range. 
FIG.12A and FIG.12B are block diagrams each showing an example of a 
feature extractor. 

FIG. 13 is a block diagram showing an example of an acoustic model 
composition module. 

FIG. 14 is a table showing a unit for recognition and a sub-model of a 
direction-dependent acoustic model. 

FIG. 15 is a schematic diagram illustrating operation of a parameter 
composition module. 
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FIG.16A and FIG.16B are graphs each showing an example of a weight 

W n . 

FIG. 17 is a table showing a training method of a weight W. 
FIG. 18 is a block diagram showing an automatic speech recognition 
5 system according to another embodiment of the present invention. 

FIG. 19 is a schematic diagram illustrating a difference in input distance 
of an acoustic signal. 

FIG. 20 is a block diagram showing an automatic speech recognition 
system according to another embodiment of the present invention. 
10 FIG. 21 is a block diagram showing a stream tracking module. 

FIG.22 is a graph showing a sound direction history. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 
[First Embodiment] 

15 Detailed description is given of an embodiment of the present invention 

with reference to the appended drawings. FIG.l is a block diagram showing an 
automatic speech recognition system according to a first embodiment of the 
present invention. 

As shown in FIG.l, an automatic speech recognition system 1 according 
20 to the first embodiment includes two microphones Mr and Ml, a sound source 
localization module 10, a sound source separation module 20, an acoustic 
model memory 49, an acoustic model composition module 40, a feature 
extractor 30 and a speech recognition module 50. The module 10 localizes a 
speaker (sound source) receiving acoustic signals detected by the microphones 
25 Mr and Ml. The module 20 separates acoustic signals originating from a sound 
source at a particular direction based on the direction of the sound source 
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localized by the module 10 and spectrums obtained by the module 10. The 
module 49 stores acoustic models adjusted to a plurality of directions. The 
module 40 composes an acoustic model adjusted to a sound direction, based on 
the sound direction which is localized by the module 10 and the acoustic 
models stored in the module 49. The module 30 extracts features of acoustic 
signals based on a spectrum of the specified sound source, which is separated 
by the module 20. The module 50 performs speech recognition based on the 
acoustic model composed by the module 40 and the features of the acoustic 
signals extracted by the module 30. Among these modules, the module 20 is 
not mandatory but adopted as the case may be. 

The invention, in which the module 50 performs speech recognition with 
the acoustic model that is composed and adjusted to the sound direction by the 
module 40, is able to provide a high recognition rate. 

Next, description is given of the microphones Mr and Ml, the sound 
source localization module 10, the sound source separation module 20, the 
feature extractor 30, the acoustic model composition module 40 and the speech 
recognition module 50, respectively. 
(Microphones Mr and Ml) 

The microphones Mr and Ml are each a typical type of microphone, 
which detects sounds and generates electric signals (acoustic signals). The 
number of microphones is not limited to two as is exemplarily shown in this 
embodiment, but it is possible to select any number, for example three or four, 
as long as it is plural. The microphones Mr and Ml are, for example, installed 
in the ears of a robot RB N a moving object. 

A typical front of the automatic speech recognition system 1 in terms of 
collecting acoustic signals is defined by an arrangement of the microphones Mr 
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and Ml. It is mathematically described that a direction resulting from a sum of 
vectors, each being oriented to a sound collected by one of the microphones Mr 
and Ml, will coincide with the front of the automatic speech recognition system 
1. As shown in FIG.l when the microphones Mr and Ml are installed on left 
and right sides of a head of the robot RB, a front of the robot RB will coincide 
with the front of the automatic speech recognition system 1. 
(Sound source localization module 10) 

FIG.2 is a block diagram showing an example of a sound source 
localization module. FIG.3 and FIG.4 are schematic diagrams each describing 
operation of a sound source localization module. 

The sound source localization module 10 localizes a direction of sound 
source for each of speakers HMj (HM1 and HM2 in FIG.3, for example) based 
on two kinds of acoustic signals received from the two microphones Mr and Ml. 
There are some methods for localizing a sound source such as: a method for 
utilizing a phase difference between acoustic signals entering the microphones 
Mr and Ml, a method for estimating with head related transfer function of a 
robot RB and a method for establishing a correlation between signals entering 
through the right and left microphones Mr and Ml. Each of the methods 
described above has been improved in various ways so as to increase accuracy. 
Description is given here of a method as an example, with which the in venters 
of the present invention have succeeded in attaining improvement. 

As shown in FIG.2, the sound source localization module 10 includes a 
frequency analysis module 11, a peak extractor 12, a harmonic relationship 
extractor 13, an IPD calculator 14, an IID calculator 15, a hypothesis 16 by 
auditory epipolar geometry, a belief factor calculator 17 and a belief factor 
integrator 18. 
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Each of these portions will be described with reference to FIG.3 and 
FIG.4. A situation where the speakers HM1 and HM2 simultaneously start 
speaking to the robot RB is assumed in the following description. 
(Frequency analysis module 11) 

The frequency analysis module 11 cuts out a signal section having a 
microscopic time length At from right and left acoustic signals CR1 and CL1, 
which are detected by the right and left microphones Mr and Ml installed in 
the robot RB, performing a frequency analysis for each of left and right 
channels with Fast Fourier Transform (FFT). 

Results obtained from the acoustic signals CR1, which are received from 
the right microphone Mr, are designated as a spectrum CR2. Similarly, results 
obtained from the acoustic signals CL1, which are received from the left 
microphone Ml, are designated as a spectrum CL2. 

It may be alternatively possible to adopt other methods for frequency 
analysis, such as a band pass filter. 
(Peak extractor 12) 

The peak extractor 12 extracts consecutive peaks from the spectrums 
CR2 and CL2 for the right and left channels, respectively. One method is to 
directly extract local peaks of a spectrum. The other one is to use a method 
based on spectral subtraction method (See S. F. Boll, A spectral subtraction 
algorithm for suppression of acoustic noise in speech, Proceedings of 1979 
International conference on Acoustics, Speech, and signal Processing 
(ICASSP-79)). The latter method extracts peaks from a spectrum and 
subtracts the extracted peaks from the spectrum, generating a residual 
spectrum. A process for extracting peaks will be repeated until no peaks are 
found in the residual spectrum. 



- 12 - 



P10218 



When extraction of peaks is carried out for the spectrums CR2 and CL2, 
only sub-band signals forming peaks such as peak spectrums CR3 and CL3 are 
extracted. 

(Harmonic relationship extractor 13) 

The harmonic relationship extractor 13 generates a group, which 
contains peaks having a particular harmonic relationship, for each of the right 
and left channels, according to harmonic relationship which a sound source 
possesses. Taking a human voice, for example, a voice of a specified person is 
composed of sounds having fundamental frequencies and their harmonics. 
Because fundamental frequencies slightly differ from person to person, it is 
possible to categorize voices of a plurality of persons into groups according to 
difference in the frequencies. The peaks, which are categorized into a group 
according to harmonic relationship, can be estimated as signals generated by a 
common sound source. If a plural number (J) of speakers is simultaneously 
speaking, for example, the same plural number (J) of harmonic relationships is 
extracted. 

In FIG.3, peaks PI, P3 and P5 of the peak spectrum CR3 are categorized 
into one group of harmonic relationship CR41. Peaks P2, P4 and P6 of the 
peak spectrum CR3 are categorized into one group of harmonic relationship 
CR42. Similarly, peaks PI, P3 and P5 of the peak spectrum CL3 are 
categorized into one group of harmonic relationship CL41. Peaks P2, P4 and 
P6 of the peak spectrum CL3 are also categorized into one group of harmonic 
relationship CL42. 
(IPD calculator 14) 

The IPD calculator 14 calculates an interaural phase difference (IPD) 
from spectrums of the harmonic relationships CR41, CR42, CL41 and CL42. 
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Let us suppose that a set of peak frequencies included in a harmonic 
relationship (the harmonic relationship CR41, for example) corresponding to a 
speaker HMj is \f k \k = 0...A: -l} . The IPD calculator 14 selects a spectral 
sub-band corresponding to each f k from both right and left channels 
(harmonic relationships CR41 and CL41, for example), calculating IPDA$(f k ) 
with an equation (1). The IPDA<p(f k ) calculated from the harmonic 

relationships CR41 and CL41 results in an interaural phase difference C51, as 
shown in FIG.4. Where A<p(f k ) is an IPD for a harmonic component f k lying 

in a harmonic relationship and K represents number of harmonics lying in this 



harmonic relationship. 



A<f>(f k ) = arctan 



3LS r (A)] 



r 



— arctan 



v 
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J 



where: 



: IPD (interaural phase difference) for f k 



3[S r (f k )] : an imaginary part of spectrum for a peak f k of right input 



signal 

ns r (/ k )] 



: a real part of spectrum for a peak /. of right input signal 



(f k )] : an imaginary part of spectrum for a peak f k of left input signal 
SltS, (f k )] : a real part of spectrum for a peak f k of left input signal 
(IID calculator 15) 

The IID calculator 15 calculates a difference in sound pressure between 
sounds received from the right and left microphones Mr and Ml (interaural 
intensity difference) for a harmonic belonging to a harmonic relationship. 

The IID calculator 15 selects a spectral subband, which corresponds to a 
harmonic having a peak frequency f k lying in a harmonic relationship of a 



speaker HMj (harmonic relationships CR41 and CL41, for example), from both 
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right and left channels (harmonic relationships CR41 and CL41, for example), 
calculating an IIDAp(f k ) with an equation (2). The IIDAp(f k ) calculated 

from the harmonic relationships CR41 and CL41 results in an interaural 
intensity difference C61 as shown in FIG.4, for example. 

where: 

Ap(f k ) : IID (interaural intensity difference) for f h 

PMk) : power for peak f k of a right input signal 

PMu) : P° wer f° r peak f k of a left input signal 

p r (f k ) = 1 0 log , 0 ( 3[5 r (f k )] 2 + 3* [S r (f k )] 2 ) 

p l (/ k ) = ioio glo ms l </ k )] 2 + ms, (/ k )] 2 ) 

(Hypothesis 16 by auditory epipolar geometry) 

Let's see FIG.5, in which a head portion of the robot RB, which is 
modeled by a sphere, is viewed from upward. The hypothesis 16 by auditory 
epipolar geometry represents data of phase difference, which is estimated 
based on a time difference resulting from a difference in distance with respect 
to a sound source S between the microphones Mr and Ml, which are installed 
in both ears of the robot RB. 

According to auditory epipolar geometry, a phase difference is 

obtained with an equation (3). It is assumed here that the sphere is 
representative of the shape of the head. 

A0 = ^^xr(0 + sin0) (3) 
v 

where represents an interaural intensity phase difference (IPD), v sound 
velocity, / a frequency, r is a value depending from an interaural distance 
2r and 0 represents a direction of a sound source. 
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The relationship between a phase difference A<p and a frequency / of 
acoustic signals, which come from a direction of a sound source, is obtained 
with the equation (3) and shown in FIG.6. 
(Belief factor calculator 17) 

The belief factor calculator 17 calculates a belief factor for IPD and IID, 
respectively. 

Description is first given of "IPD belief factor". An IPD belief factor is 
obtained as a function of 0 so as to indicate which direction a harmonic 
component f k is likely to come from, which is included in a harmonic 

relationship (harmonic relationship CR41 or CL41, for example) corresponding 
to a speaker HMj. The IPD is fitted into a probability function. 

First, a hypothetical IPD (estimated value) for f k is calculated with an 

equation (4). 

Afa WJ k ) = x r W + sin 0) (4) 

v 

A0 h (0,f k ) represents a hypothetical IPD (estimated value) with respect 
to a sound source lying in a direction 9 for a kth harmonic component f k . 
Thirty-seven hypothetical IPD's are, for example, calculated while a direction 
0 of a sound source is varied over a range of ± 90° at intervals of 5° . It may 
be alternatively possible to calculate at finer or rougher angle intervals. 

Next, a difference between A<f> h (6J k ) and A0(f k ) is calculated with an 
equation (5) and a summation is obtained for all the peak frequencies f k . This 
difference, which represents a distance between a hypothesis and an input, 
tends to take a smaller value if 0 lies closer to a direction of a speaker but a 
larger value if 0 lies remoter from the direction of the speaker. 
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A belief factor B IPD {0) is obtained by entering the resulting d{0) in a 
probability function, the following equation (6). 

(0) 1 



' x^ 



dx 



(6) 



where X(0) = (d(0)-m)/ y [s/n, m is a mean of d(0), s is a variance of d(0) 
and « is a number of hypothetical IPD's (37 in this embodiment). 

Description is given of "IID belief factor". An IID belief factor is obtained 
in the following manner. A summation of intensity differences included in a 
harmonic relationship corresponding to a speaker HMj is calculated with an 
equation (7). 



S = ^Ap(f k ) 



(7) 



*=0 



where K represents number of harmonics included in a harmonic relationship, 
Ap(f k ) is an IID calculated by the IID calculator 15. 

Introducing Table 1, a likelihood to be right, center or left associated 

with a sound direction is transformed into a belief factor. In this connection, 

Table 1 shows empirical values. 

When a hypothetical sound direction 9 is equal to 40° and an intensity 
difference S has a positive sign, for example, a belief factor B I[D (0) is regarded 

as 0.35 according to the left-upper box of Table 1. 



Table 1 



9 


90° ~ 30° 


30° - -30° 


- 30° - -90° 


S 


+ 


0.35 


0.5 


0.65 




0.65 


0.5 


0.35 



(Belief factor integrator 18) 

The belief factor integrator 18 integrates an IPD belief factor B Jpn {0) 
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and an IID belief factor B IID (0) based on Dempster — Shafer theory with an 



equation (8), calculating an integrated belief factor B IPD + !ID (0) . A 0 which 
provides a largest B IPD+IID (0) is considered to coincide with a direction of a 
speaker HMj, so that it is denoted as 0 HMj in the description below. 

B IP o+iid (*) = 1 - a - B /PD (0))(l - B UD W)) (8) 

It may be alternatively possible to use a hypothesis by head related 
transfer function or a hypothesis by scattering theory instead of the hypothesis 
by auditory epipolar geometry. 
(Hypothesis by head related transfer function) 

A hypothesis by head related transfer function is a phase difference and 
an intensity difference for sounds detected by microphones Mr and Ml, which 
are obtained from impulses generated in a surrounding environment of a 
robot. 

The hypothesis by head related transfer function is obtained in the 
following manner. The microphones Mr and Ml detect impulses, which are 
sent at appropriate intervals (5° , for example) over a range of -90° to 90° . A 
frequency analysis is conducted for each impulse so as to obtain a phase 
response and a magnitude response with respect to frequencies / . A 

difference between phase responses and a difference between magnitude 
responses are calculated to provide a hypothesis by head related transfer 
function. 

The hypothesis by head related transfer function, which is calculated as 
described above, results in IPD shown in FIG.7A and IID shown in FIG.7B. 

When a head related transfer function is introduced, it is possible to 
obtain a relationship between IID and a frequency of a sound coming from a 
certain direction in addition to IPD. Therefore, a belief factor is calculated 
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based on distance data d{0) , which has been generated for both IPD and IID. 
The method for generating hypothesis is the same for IPD and IID. 

Different from the method for generating a hypothesis with auditory 
epipolar geometry, a hypothesis by head related transfer function establishes a 
relationship between frequency / and IPD for a signal, which is generated in 
each sound direction, by means of measurement in lieu of calculation. A d{0) , 
which is a distance between a hypothesis and an input, is directly calculated 
from actual measurement values shown in FIGS.7A and 7B, respectively. 
(Hypothesis by scattering theory) 

Scattering theory estimates both IPD and IID, taking into account 
waves scattered by an object, which scatters sounds, a head of a robot, for 
example. It is assumed here that a head of a robot is an object which has a 
main effect on the input of a microphone and the head is a sphere having a 
radius "a". It is also assumed that coordinates representative of the center of 
the head are an origin of a polar coordinate. 

When r 0 is a position of a point sound source and r is an observation 

point, a potential due to a direct sound at the observation point is defined by 
an equation (9). 

V i =—^—e v (9) 
2*Rf 

where : 

/ : frequency of point sound source 
v : sound velocity 

R : distance between a point sound source and an observation point 

As shown in "J. J. Bowman, T.B.A. Senior, and P.L.E. Uslenghi: 
Electromagnetic and Acoustic Scattering by simple shapes, Hemisphere 
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Publishing Co., 1987" and the like, a potential due to direct and scattering 
sounds is defined by an equation (10) while the observation point r lies on a 
surface of the head. 



S(0,/) = F'+F 



< V ^ 
K 2m f J 



(i) 



2 h n 
J] (2n + !)/>„ (cos 0) — 



°-f 



/. 777/ 



(10) 



where 



F 5 : potential due to scattering sound 
P n : Legendre Function of the First Kind 
h ( n l) : Hunkel Function of the First Kind 

When polar coordinates for Mr and Ml are (a, n /2,0) and (a-n 12,0) , 
respectively, potentials at these microphones are represented by equations (11) 
and (12), respectively. 

S L (0,f) = s£-0,f) (ID 

£0 

S R (0,f) = S(~-0,f) (12) 

In this way, a phase difference IPDA<f> s {0,f ) and an intensity difference 
IIDAp s (0,f ) are calculated by the following equations (13) and (14), 
respectively. 

A</> 5 (0,f ) = axg(S L (0,f )) - arg(5^ (0,f )) (13) 
Ap 5 (0,/) = 2Olo g| o f/f/A d4) 

Replacing A$ h (0*f k ) of the equation (4) with IPDA<f> s (0,f) , a B IPD (0) is 

calculated in the same process as that for auditory epipolar geometry. 

Namely, a difference between A<f> s (0,f k ) and A<f>(f k ) is calculated and 
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a sum d(0) for all peaks f k is then calculated, which is incorporated into the 

probability density function shown in equation (6) so as to obtain a belief factor 

B IPD (0) . 

As for IID, d{0) and B IID {0) are calculated in the similar method to 
that applied to IPD. More specifically speaking, in addition to replacing A<f> 
with Ap , A0 h (0,f k ) in the equation (4) is replaced with IPDAp s (0,f k ) in the 
equation (14). Then, a difference between Ap 5 (0,f k ) and Ap(f k ) is calculated 
and a sum d(0) for all peaks f k is then calculated, which is incorporated into 

the probability density function shown in equation (6) so as to obtain a belief 
factor B IID (0). 

If a sound direction is estimated based on the scattering theory, it is 
possible to generate a model representing a relationship between a sound 
direction and a phase difference as well as between a sound direction and an 
intensity difference, taking into account speeches scattering along the surface 
of a head of robot, for example an effect by a sound traveling round a rear side 
of the head. This leads to an increase in accuracy for estimation of a sound 
direction. When a sound source lies sideways with respect to the head, it is 
particularly possible to increase the accuracy for estimation of a sound 
direction by introducing the scattering theory, because the power of a sound 
reaching to a microphone is relatively great, which lies in an opposite direction 
of the sound source. 
(Sound source separation module 20) 

The sound source separation module 20 separates acoustic (speech) 
signals for a speaker HMj according to information on a localized sound 
direction and a spectrum (spectrum CR2, for example) provided by the sound 
source localization module 10. Though there may be conventional methods 
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applicable to separation of a sound source, begun forming, null forming, peak 
tracking, a directional microphone, Independent Component analysis (ICA) 
and the like, for example, description here is given of a method with an active 
direction-pass filter developed by the inventers of the present invention. 

As a sound direction lies remoter from the front of a robot RB, it tends to 
be more difficult to expect accuracy for information on the sound direction, 
which is estimated through two microphones, in separating a sound source. In 
order to solve this problem, this embodiment employs active control so that a 
pass range is narrower for a sound source lying in the front direction but wider 
for a sound source lying remote from the front direction, thereby increasing 
accuracy for separating a sound source. 

More specifically speaking, the sound source separation module 20 
includes a pass range function 21 and a subband selector 22, as shown in 
FIG.8. 

(Pass range function 21) 

As shown in FIG.9, the pass range function 21 is a function of a sound 
direction and a pass range, which is in advance adjusted to have a greater pass 
range as a sound direction lies remoter from the front. The reason for this is 
that it is more difficult to expect accuracy for information on a sound direction 
as it lies remoter from the front (0° ). 
(Subband selector 22) 

The subband selector 22 selects a sub-band, which is estimated to come 
from a particular direction, out of respective frequencies (called "sub-band") of 
each of the spectrums CR2 and CL2. As shown in FIG. 10, the subband selector 
22 calculates IPDAfitf;) and IIDApif^ (see an interaural phase difference 

C52 and an interaural intensity difference C62 in FIG. 10) for sub-bands of a 
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spectrum according to the equations (1) and (2), based on the right and left 
spectrums CR2 and CL2, which are generated by the sound source localization 
module 10. 

Determining a 0 HMj , which is obtained by the sound source localization 

module 10, to be a sound direction which should be retracted, the subband 

selector 22 refers to the pass range function 21 so as to obtain a pass range 
5(0 HMj ) corresponding to the 0 HMj . The subband selector 22 calculates a 

maximum 0 h and a minimum 0 t according to the obtained pass range 

S(0 HMJ ) with the following equation (15). 

A pass range B is shown in FIG. 11 in the form of a plan view, for 
example. 

0/ = &HMJ ~ S ^HMj ) 

O h =0 HMj ^S(0 HMj ) (15) 

Next, estimation is conducted for IPD and IID corresponding to 0 { and 
0 h . This estimation is carried out with a transfer function, which is prepared 

in advance by measurement or calculation. The transfer function is a function 

which correlates a frequency and IPD as well as a frequency and IID, 

respectively, with respect to a signal coming from a sound direction 0. As 

described above, epipolar geometry, a head related transfer function or 

scattering theory is applied to the transfer function. An estimated IPD is, for 
example, shown in FIG. 10 as A^ 7 (/)and A<fi h (f) in an interaural phase 

difference C53, and an estimated IID is, for example, shown in FIG. 10 as 
APiif) and Ap h if) in an interaural intensity difference C63. 

Utilizing a transfer function of a robot RB, the subband selector 22 
selects a sub-band for a sound direction 0 HMj according to a frequency f f of 

the spectrum CR2 or CL2. The subband selector 22 selects a sub-band based 
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on IPD if the frequency /. is lower than a threshold frequency f th , or based 
on IID if the frequency /. is higher than the threshold frequency f th . The 

subband selector 22 selects a sub-band which satisfies a conditional equation 
(16). 

/, </,„ = <f k ) < A$(fj )<A0 h (f i ) 
/, >f th : A Pl (ft ) < Ap(f t ) < Ap h (f t ) (16) 

where f th represents a threshold frequency, based on which one of IPD and 
IID is selected as a criterion for filtering. 

According to this conditional equation, a subband of frequency /. (an 
area with diagonal lines), in which IPD lies between A<f> t (f) and A0 h (f), is 
selected for frequencies lower than the threshold frequency f th in the 

interaural phase difference C53 shown in FIG. 10. In contrast, a subband (an 
area with diagonal lines), in which IID lies between Ap t (f) and Ap h (f), is 
selected for frequencies higher than the threshold frequency f th in the 
interaural intensity difference C63 shown in FIG. 10. A spectrum containing 
selected sub-bands in this way is referred to as "extracted spectrum" in this 
specification. 

There is an alternative method, which introduces a directional 
microphone for separating a sound source, instead of the sound source 
separation module 20 according to this embodiment described above. More 
specifically speaking, a microphone with narrow directivity is installed on a 
robot RB. If the face of the robot is so controlled that the directional 
microphone is turned to a sound direction 0 acquired by the sound source 

localization module 10, it is possible to collect only speeches coming from this 
direction. 

If there is only a single directional microphone, a problem may arise 
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that collection of speeches is limited to a single person. However, it may be 
possible to allow simultaneous collection of speeches of a plurality of people if a 
plurality of directional microphones is arranged at regular intervals of a 
certain angle so that it is possible to selectively use speech signals sent by each 
directional microphone arranged for a sound direction. 
(Feature extractor 30) 

The feature extractor 30 extracts features necessary for speech 
recognition from a speech spectrum, which is separated by the sound source 
separation module 20, or an unseparated spectrum CR2 (or CL2). These 
spectrums are each referred to as "spectrum for recognition" when they are 
used for speech recognition. It is possible to use a linear spectrum as features 
of speech, Mel frequency spectrum or Mel-Frequency Cepstrum Coefficient 
(MFCC), which results from frequency analysis. In this embodiment, 
description is given of an example with MFCC. In this connection, when a 
linear spectrum is adopted, the feature extractor 30 does not carry out any 
process. In the case of Mel frequency spectrum a cosine transformation (to be 
described later) is not carried out. 

As shown in FIG. 12 A, the feature extractor 30 includes a log spectrum 
converter 31, a Mel frequency converter 32 and a discrete cosine 
transformation (DCT) module 33. 

The log spectrum converter 31 converts an amplitude of spectrum for 
speech recognition, which is selected by the subband selector 22 (see FIG.8), 
into a logarithm, providing a log spectrum. 

The Mel frequency converter 32 makes the log spectrum generated by 
the log spectrum converter 31 pass through a bandpass filter of Mel frequency, 
providing a Mel frequency spectrum, whose frequency is converted to Mel 
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scale. 

The DCT module 33 carries out a cosine transformation for the Mel 
frequency spectrum generated by the Mel frequency converter 32. A coefficient 
obtained by this cosine transformation results in MFCC. 

It may be possible to add a masking module 34, which gives an index (0 
to 1), within or after the feature extractor 30 as shown in FIG.12B so that a 
spectrum subband is not considered to have reliable features when an input 
speech is deformed due to noise. 

Description in detail is given of an example shown in FIG.12B. When a 
feature extractor 30 includes a masking module 34, a dictionary 59 possesses a 
time series spectrum corresponding to a word. Here, this time series spectrum 
is referred to as "word speech spectrum". 

A word speech spectrum is acquired by a frequency analysis carried out 
for speeches resulting from a word uttered under a noise-free environment. 
When a spectrum for recognition is entered into the feature extractor 30, a 
word speech spectrum for a word, which is estimated to exist in an input 
speech, is sorted out as an estimated speech spectrum from a dictionary. A 
criterion applied to the estimation here is that a speech spectrum having the 
most close time span as that of a spectrum for recognition is regarded as an 
expected speech spectrum. Undergoing the log spectrum converter 31, the Mel 
frequency converter 32 and the DCT module 33, the spectrum for recognition 
and the expected speech spectrum are each transformed into MFCCs. In the 
following descriptions, MFCCs of spectrum for recognition is referred to as 
"MFCCs for recognition" and MFCCs of expected speech spectrum as "expected 
MFCC". 

The masking module 34 calculates a difference between MFCCs for 
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recognition and expected MFCCs, assigning zero to an MFCC, if the difference 
is greater than a threshold estimated beforehand but one if it is smaller than 
the threshold. The masking module 34 sends the value as an index a> in 
addition to MFCCs for recognition to a speech recognition module 50. 

It may be possible to sort out one or more expected speech spectrums. It 
may be alternatively possible to adopt all word speech spectrums without 
sorting out. In this case, the masking module 34 assigns indexes a> to all 
expected speech spectrums, sending them to the speech recognition module 50. 

When a directional microphone is used for sound source separation, an 
ordinary method of frequency analysis, such as an FFT and bandpass filter, is 
applied to a separated speech so as to obtain a spectrum. 
(Acoustic model composition module 40) 

The acoustic model composition module 40 composes an acoustic model 
adjusted to a localized sound direction based on direction-dependent acoustic 
models, which are stored in the acoustic model memory 49. 

As shown in FIG. 13, the acoustic model composition module 40, which 
has an inverse discrete cosine transformation (IDCT) module 41, a linear 
spectrum converter 42, an exponential converter 43, a parameter composition 
module 44, a log spectrum converter 45, a Mel frequency converter 46 and a 
discrete cosine transformation (DCT) module 47, composes an acoustic model 
for a direction 0 by referring to direction-dependent acoustic models H(0 n ) , 

which are stored in the acoustic model memory 49. 
(Acoustic model memory 49) 

Direction dependent acoustic models H(0 n ) , which are adjusted to 
respective directions 0 n with respect to the front of a robot RB, are stored in 
the acoustic model memory 49. A direction-dependent acoustic model H(0 n ) is 
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trained on speech of a person uttered from a particular direction 0 n by way of 

Hidden Markov Model (HMM). As shown in FIG. 14, a direction-dependent 
acoustic model H(0 n ) employs a phoneme as a unit for recognition, storing a 
corresponding sub-model h(m,0 n ) for the phoneme. In this connection, it may 
be possible that other units for recognition such as monophone, PTM, biphone, 
triphone and the like are adopted for generating a sub-model. 

If there are seven sub-models at regular intervals of 30° over a range 
-90° to 90° in terms of direction 0 n and each sub-model is composed of 40 
pieces of monophone, the number of sub-models h(m f 0 n ) results in 
7x40 = 280. 

A sub-model h(m,0 n ) has parameters such as number of states, a 
probability density distribution for each state and state transition probability. 
In this embodiment, the number of states for a phoneme is fixed to three: front 
(state 1), middle (state 2) and rear (state 3). Although a normal distribution is 
adopted in this embodiment, it may be alternatively possible to select a 
mixture model made of one or more other distributions in addition to a normal 
distribution for the probability density distribution. In this way, the acoustic 
model memory 49 according to this embodiment is trained on a state transition 
probability P and parameters of a normal distribution, namely a mean ju and 

a standard deviation <j . 

Description is given of steps for generating training data for a sub-model 

h(m,0 n ). 

Speech signals, which include particular phonemes, are applied to a 
robot RB by a speaker (not shown) in a direction, for which an acoustic model 
is intended to generate. The feature extractor 30 converts the detected acoustic 
signals to MFCC, which the speech recognition module 50 to be described later 
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recognizes. In this way, a probability for a recognized speech signal is obtained 
for each phoneme. An acoustic model undergoes adaptive training, while a 
teaching signal indicative of a particular phoneme corresponding to a 
particular direction is given to the resulting probability. The acoustic model 
undergoes further training with phonemes and words of sufficient kinds 
(different speakers, for example) to learn a sub-model. 

When a speech for training is given, it may be possible to give another 
speech as noise in a direction different from that, in which generation of an 
acoustic model is intended. In this case, the speech separation module 20 
separates only a speech, which lies in a direction intended for generating an 
acoustic model, and then the feature retractor 30 converts the speech to 
MFCCs. In addition, if an acoustic model is intended for unspecified speakers, 
it may be possible for the acoustic model to be trained on their voices. In 
contrast, if an acoustic model is intended for specified speakers individually, it 
may be possible for the acoustic model to lean with each speaker. 

The IDCT module 41 to the exponential converter 43 restore an MFCC 
of probability density distribution to a linear spectrum. They carry out a 
reverse operation for a probability density distribution in contrast to the 
feature extractor 30. 
(IDCT module 41) 

The IDCT module 41 carries out inverse discrete cosine transformation 
for MFCC, which is possessed by a direction-dependent acoustic model H(0 n ) 

stored in the acoustic model memory 49, generating a Mel frequency spectrum. 
(Linear spectrum converter 42) 

The linear spectrum converter 42 converts frequencies of the Mel 
frequency spectrum, which is generated by the IDCT module 41, to linear 
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frequencies, generating a log spectrum. 
(Exponential converter 43) 

The exponential converter 43 carries out an exponential conversion for 
the intensity of the log spectrum, which is generated by the linear spectrum 
converter 42, so as to generate a linear spectrum. The linear spectrum is 
obtained in the form of a probability density distribution of a mean /u and a 

standard deviation a . 
(Parameter composition module 44) 

As shown in FIG. 15, the parameter composition module 44 multiplies 
each direction-dependent acoustic model H(0 n ) by a weight and makes a sum 
of the resulting products, composing an acoustic model H(0 HMj ) for a sound 
direction 0 HMJ . Sub-models lying in a direction-dependent acoustic model 
H(0 n ) are each converted to a probability density distribution of linear 

spectrum by the IDCT module 41, the linear spectrum converter 42 and the 
exponential converter 43, having parameters such as means // Iww , /i 2mw , // 3ww , 
standard deviations ^'^'^ an( i state transition 
probabilities lww , P l2nm , P 22nm , P 23nm , Pw nm * The module 44 normalizes an acoustic 
model for a sound direction 0 HM . by multiplying these parameters and 

weights, which are obtained beforehand by training and stored in the acoustic 

model memory 49. In other words, the module 44 composes an acoustic model 
for a sound direction 0 by taking a linear summation of 

direction-dependent acoustic models H(0 n ) . In this connection, it will be 
described later how a weight W n€HMj is introduced. 

When sub-models lying in H(0 HM . ) are composed, a mean // Itfft ^ TO of the 

state 1 is calculated by an equation (17). 
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1 V 

M\6HMjm = Z^WnMMj^lnm (17) 

nGHMj 



Means n 2GHMjm an< * MwMjm can ^ e calculated similarly. 

For composition of a standard deviation cr l6HMJm of the state 1, a 



2 

covariance c l6HMJm is calculated by an equation (18). 



1 N 



(J \6HMjm ~ jv ^Wn6HMi Cr \nm (18) 



Standard deviations cr 26HMjm and a 3€HMJm can be obtained similarly. It is 
possible to calculate a probability density distribution with the obtained // 
and <j . 

Composition of a state transition probability P U6HMjm for state 1 is 

calculated by an equation (19). 

1 N 

P\\€HMjm ^J^ n6HMj P XXnm (19) 

n=\ 

State transition probabilities P X26HMjm , P 22€HMjm , P 23 w and /> can 

be calculated similarly. 

Next, a probability density distribution is reconverted to MFCC by a log 
converter 45 through a DCT module 47. Because the log converter 45, Mel 
frequency converter 46 and DCT module 47 are similar to the log converter 31, 
Mel frequency converter 32 and DCT converter 33, respectively, description in 
detail is not repeated. 

When a probability density distribution is composed in the form of a 

mixture normal distribution instead of a single normal distribution, a 
probability density distribution f X€HMJm (x) is calculated by an equation (20) 

instead of the calculation of the mean ju and standard deviation a described 

above. 
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/.«#. Cx) = — jy n(HM /> nm {x) (20) 

Probability density distributions f 2GHMjm (x) and / 3£HM>m (x) can be 
calculated similarly. 

The parameter composition module 44 has the acoustic model described 
above stored in the acoustic model memory 49. 

In this connection, the parameter composition module 44 carries out in 
real time such acoustic model composition while the automatic speech 
recognition system 1 is in operation. 
(Setting of a weight W n6HMj ) 

A weight W n6HMJ is assigned to a direction-dependent acoustic model 
H{0 n ) when an acoustic model for a sound direction 0 HMj is composed. It may 
be possible to adopt a common weight W nGHMj for all sub-models h(m,0 n ) or an 
individual weight W mn€HMj for each sub-model h(m,0 n ). Basically speaking, a 
function / (0), which defines a weight W n0O for a sound source lying in front 

of the robot RB, is prepared in advance. When an acoustic model is composed 
for a sound direction 0 HMj , a corresponding function / (0) is obtained by 

shifting f(0) along a 0- axis by 0 HMj (0->0-0 m ) . A W n€HMj is 

determined by referring to the resulting function / (0) . 

(Generation of a function / (0) ) 

a. Method of generating / (0) empirically 

When / (0) is empirically generated, / (0) is described by the 
following equations with a constant "a", which is empirically obtained. 
f(0) = a0 + a (f(0)=O when 0<O,0 = -90°) 
f(0) = -a0 + a (f (0) = O when 0> 0,0 = 90°) 

Assuming the constant a = 1 .0 , f (0) for a front sound source results in 
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FIG.16A. FIG.16B shows / (0) , which is shifted along the 0 - axis by 0 m . 
b. Method of generating / (0) by training 

When / (0) is generated by training, training is carried out in the 

following manner, for example. 

W mn0O represents a weight applied to an arbitrary phoneme "m", which 
lies in the front. A trial is conducted with an acoustic model H(0 O ) , which is 
composed with a weight W mn6Q that is appropriately selected as an initial 
value, so that the acoustic model H(0 Q ) recognizes a sequence of phonemes 

including a phoneme "m", [m m' m"] for example. More specifically speaking, 
this sequence of phonemes is given by a speaker, which is placed in the front 
and the trial is carried out. Though it is possible to select a single phoneme 
"m" as training data, a sequence of phonemes is adopted here, because it is 
possible to attain better results of training with the sequence of phonemes, 
which is a train of plural phonemes. 

FIG. 17 exemplarily shows results of recognition. In the FIG. 17, the 
result of recognition with the acoustic model H(0 O ) , which is composed with 
the initial value W mn0Oy is shown in the first row, and results of recognition 
with the acoustic model are shown in the second row or below. For 

example, it is shown that the recognition result with an acoustic model H(0 9O ) 

was a sequence of phonemes [/x//y//z/] and the recognition result with an 
acoustic model H(0 O ) was a sequence of phonemes [/x//y/m"]. 

Seeing the first phoneme in FIG. 17 after the first trail, when a 

corresponding phoneme is recognized for a direction within a range of 0 = ±90° 
relative to the front, a weight W mn09O for a model representative of the 

direction is increased by Ad . Ad is set to be 0.05, for example, which is 
empirically determined. In contrast, when no corresponding phoneme is 
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recognized for a direction, a weight for a model representative of the 

direction is decreased by Adl{n-k) . In this way, a weight for a 
direction-dependent model having produced a correct answer is increased, but 
one without a correct answer is decreased. 

Since //(0 n )and H(0 9Q ) each have a correct answer in the case of the 
example shown in FIG. 17, corresponding weights W mn6 and W m90eo are 
increased by Ad , but other weights are decreased by lAdKn - 2) . 

On the other hand, when there are no directions 0 n , in which a 

phoneme coinciding with the first phoneme is recognized and there is a 
dominant direction-dependent acoustic model H(0 n ) having a larger weight 
relative to other models, a weight is decreased for only this model H(0 n ) by 
Ad and other weights are increased by kAd/(n-k) . Because the fact that any 
direction-dependent acoustic model failed recognition implies that a current 
distribution of weights is inappropriate, a reduction in weight is implemented 
for the direction, in which the current weight works dominantly. 

It is determined whether a weight is dominant or not by checking 
whether the weight is larger than a predetermined threshold (0.8 here, for 
example). If there are no dominant direction-dependent acoustic models H(0 n ) , 

only the maximum weight is decreased by Ad and other weights for other 
direction-dependent acoustic models H(0 n ) are increased by Ad l(n - 1) . 

And the trial described above is repeated with the updated weights. 
When the recognition of the acoustic model H(0 9O ) results in a correct 

answer "m", the repetition is stopped, and recognition and training is moved to 
the next phoneme m' or training is stopped. When the training is stopped, the 
weight W mn09O obtained here will be / (0) . When moved to the next phoneme 
m', a mean of weights JV mn09O , which result from training of all the phonemes, 
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will be / (0) . 

It may be alternatively possible to assign a weight W mn€HMj 
corresponding to each sub-model h(m,0 n ) to / (0) without taking a mean. 

When a given number of trials (0.5/ Ad times, for example) does not 
allow the recognition result of an acoustic model H(0 HMJ ) to be a correct 

answer, recognition of "m" is not successful for example, the trial is moved to 
training of a next phoneme m\ Weights are updated by the same value as the 
distribution of weight for a phoneme (m' for example), which is successfully 
recognized at last. 

It may be possible to prepare beforehand a common weight W n€HMj , 
which is used by all sub-models h(m,0 n ) included in H(0 n ) (see Table 2), or 
Table 3, which shows a weight W nfjHMj corresponding to each sub-model 
h(m,0 n ) , for an appropriate 0 HMj . In this connection, subscripts 
1 ... m ... m represent phonemes and N directions, in Table 2 and Table 3. 
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Table 3 
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The weights obtained by training described above are stored in the 

acoustic model memory 49. 

(Speech recognition module 50) 

Using an acoustic model H(0 HMJ ) composed for a sound direction Q HMj , 

the speech recognition module 50 recognizes features, which are extracted 
from separated speech of a speaker HMj or an input speech, generating 
character information. Subsequently, the module 50 recognizes the speech 
referring to the dictionary 59 to provide results of recognition. Since this 
method of speech recognition is based on an ordinary technique with Hidden 
Markov Model, description in detail would be omitted. 

When a masking module, which adds an index a> indicating a belief 
factor to each sub-band of MFCC, is disposed inside or after the feature 
extractor 30, the speech recognition module 50 carries out recognition after 
applying a process shown by an equation (21) to a received feature. 

X r ~^~ X n Q6) 

x n (i)=x(i)xo)U) 

x r : feature to be used for speech recognition 
x : MFCC 
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/ : component of MFCC 

x n : unreliable component of x 

Using the obtained output probability and state transition probability, 
the module 50 performs recognition in the same manner as that of general 
Hidden Markov Model. 

Description is given of operation carried out by an automatic speech 
recognition system 1 configured as described above. 

As shown in FIG.l, speeches of a plurality of speakers HMj (see FIG.3) 
enter microphones Mr and Ml of a robot RB. 

Sound directions of acoustic signals detected by the microphones Mr and 

Ml are localized by a sound source localization module 10. As described above, 

the module 10 calculates a belief factor with hypothesis by auditory epipolar 

geometry after conducting frequency analysis, peak extraction, extraction of 

harmonic relationship and calculation of IPD and IID. Integrating IPD and IID, 
the module 10 subsequently regards the most probable 6 HMj as a sound 

direction (see FIG.2). 

Next, a sound source separation module 20 separates a sound 
corresponding to a sound direction 0 HMj . Sound separation is carried out in 
the following manner. First, the module 20 obtains upper limits A<fi h (f) and 
Ap h (f), and lower limits Afyif) and Ap t (f) for IPD and IID for a sound 
direction 0 with a pass range function. The module 20 selects sub-bands 

(selected spectrum) which are estimated to be a spectrum for the sound 
direction 0 by introducing the equation (16) described above and these 

upper limits and lower limits. Subsequently, the module 20 converts the 
spectrum of the selected sub-bands by reverse FFT, transforming the spectrum 
into speech signals. 
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A feature extractor 30 converts the selected spectrum separated by the 
sound source separation module 20 into MFCC by a log spectrum converter 31, 
a Mel frequency converter 32 and a DCT module 33. 

On the other hand, an acoustic model composition module 40 composes 
an acoustic model, which is considered appropriate for a sound direction 0 HMJ 
receiving a direction-dependent acoustic model H(0 n ) stored in an acoustic 
model memory 49 and a sound direction 0 localized by the sound source 

localization module 10. 

The acoustic model composition module 40, which has an IDCT module 
41, a linear spectrum converter 42 and an exponential converter 43, converts 
the direction-dependent acoustic model H(0 n ) into a linear spectrum. A 
parameter composition module 44 composes an acoustic model H(0 HMJ )for a 
sound direction 0 by taking an inner product of a direction-dependent 
acoustic model H(0 n ) and a weight W n€HMj for a sound direction 0 HMj , which 

the module 44 reads out from the acoustic model memory 49. The module 40, 

which has a log spectrum converter 45, a Mel frequency converter 46 and a 
DCT module 47, converts this acoustic model H(0 HMJ ) in the form of a linear 

spectrum to an acoustic model H(0 HMJ ) in the form of MFCC. 

Next, a speech recognition module 50 carries out speech recognition 
with Hidden Markov Model, using the acoustic model H(0 HMJ ) composed by 

the acoustic model composition module 40. 

Table 4 shows an example resulting from the method described above. 
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Table 4 





Conventional method 


This 
invention 


Direction of 
acoustic model 


-90° 


-60° 


-30° 


0° 


30° 


60° 


90° 


40° 


Recognition 
rate of isolated 
word 


20% 


20% 


38% 


42% 


60% 


59% 


50% 


78% 



As shown in Table 4, when direction-dependent acoustic models were 
prepared for a range of -90° to 90° at regular intervals of 30° and speech 
recognition was carried out for isolated words with each acoustic model in a 
direction of 40° (conventional method), the best recognition rate was 60%, 
which was obtained by a direction-dependent acoustic model for a direction of 
30° . In contrast, recognition of isolated words with an acoustic model for a 
direction of 40° , which was composed with a method according to this 
embodiment, attained high recognition rate of 78%. Because it is possible for 
an automatic speech recognition system 1 according to this embodiment to 
compose an appropriate acoustic model each time speech is uttered in an 
arbitrary direction, high recognition rate can be realized. In addition, it is 
possible for the system 1, which is able to recognize speech uttered in an 
arbitrary direction, to implement speech recognition with high recognition rate 
while a sound source or a moving object (robot RB) is moving. 

Because it may be alternatively possible to prepare a small number of 
direction-dependent acoustic models, at intervals of 60° or 30° in terms of 
sound direction, for example, it may be possible to decrease costs necessary for 
training of the acoustic models. 

Because it is sufficient to carry out speech recognition for a single 
composed acoustic model, parallel processing is not required so as to carry out 
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speech recognition for acoustic models representative of plural directions, 
which may lead to a reduction in calculation cost. Therefore, the automatic 
speech recognition system 1 according to this embodiment is appropriate for 
real-time processing and embedded use. 

The present invention is not limited to the first embodiment, which has 
been described so far, but it may be possible to implement alternatives such as 
modified embodiments described below. 

[Second embodiment] 

A second embodiment of the present invention has a sound source 
localization module 110, which localizes a sound direction with a peak of 
correlation, instead of the sound source localization module 10 of the first 
embodiment. Because the second embodiment is similar to the first 
embodiment except for this difference, description would not be repeated for 
other modules. 

(Sound source localization module 110) 

As shown in FIG. 18, the sound source localization module 110 includes 
a frame segmentation module 111, a correlation calculator 112, a peak 
extractor 113 and a direction estimator 114. 
(Frame segmentation module 111) 

The frame segmentation module 111 segments acoustic signals, which 
have entered right and left microphones Mr and Ml, so as to generate 
segmental acoustic signals having a given time length, 100msec for example. 
Segmentation process is carried out at appropriate time intervals, 30msec for 
example. 
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(Correlation calculator 112) 

The correlation calculator 112 calculates a correlation by an equation 
(22) for the acoustic signals of the right and left microphones Mr and Ml, 
which have been segmented by the frame segmentation module 111. 



where: 

CC(T) : correlation between x L (/) and x R (t) 
T : frame length 

x L (t) : input signal from the microphone L segmented by frame length T 
x R (t) : input signal from the microphone R segmented by frame length T 
(Peak extractor 113) 

The peak extractor 113 extracts peaks from the resulting correlations. 
Peaks are selected in order of peak height while their number is adjusted to 
the number of sound sources when it is known in advance. When the number 
of sound sources is not known, on the other hand, it may be possible to extracts 
all peaks exceeding a predetermined threshold or a predetermined number of 
peaks in order of peak height. 
(Direction estimator 114) 

Receiving the obtained peaks, the direction estimator 114 calculates a 

difference of distance "d" shown in FIG. 19 by multiplying an arrival time 

difference D of acoustic signals entering the right and left microphones Mr and 

Ml by sound velocity V. The direction estimator 114 then generates a sound 
direction 0 HMj by the following equation. 

0 HMj = arcsinW 1 2r) 

The sound source localization module 110, which introduces the 




(22) 
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correlation described above, is also able to estimate a sound direction 0^ . It 

is possible to increase a recognition rate with an acoustic model appropriate for 
the sound direction 8 HMj , which is composed by an acoustic model composition 

module 40 described above. 



[Third embodiment] 

A third embodiment has an additional function that a sound source 
localization module performs speech recognition while it is checking if acoustic 
signals come from a same sound source. Description would not be repeated for 
modules which are similar to those described in the first embodiment, bearing 
the same symbols. 

As shown in FIG.20, an automatic speech recognition system 100 
according to the third embodiment has an additional module, a stream 
tracking module 60, compared with the automatic speech recognition system 1 
according to the first embodiment. Receiving a sound direction localized by a 
sound source localization module 10, the stream tracking module 60 tracks a 
sound source so that it checks if acoustic signals continue coming from the 
same sound source. If it succeeds in confirmation, the stream tracking module 
60 sends the sound direction to a sound source separation module 20. 

As shown in FIG.21, the stream tracking module 60 has a sound 

direction history memory 61, a predictor 62 and a comparator 63. 

The sound direction history memory 61 stores time, a direction and a 
pitch (a fundamental frequency f 0 which a harmonic relationship of the sound 

source possesses) of a sound source at this time, in the correlated form. 

The predicator 62 reads out the sound direction history of the sound 
source, which has being tracked so far, from the sound direction history 
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memory 61. Subsequently, the predicator 62 predicts a stream feature vector 
(Ohmji/o) with a Kalman filter and the like, which is made of a sound direction 
Ohmj and a fundamental frequency f 0 at current time tl, sending the stream 
feature vector (0 HMJ ,f o ) to the comparator 63. 

The comparator 63 receives from the sound source localization module 
10 a sound direction 0 HMj of each speaker HMj and a fundamental 

frequency f 0 of the sound source at current time tl, which has been localized 

by the sound source localization module 10. The comparator 63 compares a 
predicted stream feature vector (0 HMj ,f o )y which is sent by the predicator 62, 
and a stream feature vector (0 HMJ ,f o ) resulting from a sound direction and a 
pitch, which are localized by the sound source localization module 10. If a 

resulting difference (distance) is less than a predetermined threshold, the 
comparator 63 sends the sound direction 0 to the sound source separation 

module. The comparator 63 also makes the stream feature vector (0 HMJ ,f o ) 
store in the sound direction history memory 6 1 . 

If the difference (distance) is more than the predetermined threshold, 
the comparator 63 does not send the localized sound direction 0 to the 

sound source separation module 20, so that speech recognition is not carried 

out. In this connection, it may be alternatively possible for the comparator 63 

to send data, which indicates whether or not a sound source can be tracked, to 
the sound source separation module 20 in addition to a sound direction 0 HMj . 

It may be alternatively possible to use only a sound direction 0 HMj 

without a fundamental frequency f Q in performing prediction. 

In the automatic speech recognition system 100, a sound direction which 
is localized by the sound source localization module 10 and a pitch enter the 
stream tracking module 60 described above. In the stream tracking module 60, 
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the predicator 62 reads out a sound direction history stored in the sound 
direction history memory 61, predicting a stream feature vector (fl! w ,/ 0 ) at a 
current time tl. The comparator 63 compares a stream feature vector 
which is predicted by the predicator 62 and a stream feature vector (0 HMJ ,f o ) 
resulting from values, which are sent by the sound source localization module 
10. If the difference (distance) is less than a predetermined threshold, the 
comparator 63 sends a sound direction to the sound source separation module 
20. 

The sound source separation module 20 separates sound sources based 
on spectrum data, which is sent by the sound source localization module 10, 
and sound direction 6 HMj data, which is sent by the stream tracking module 

60, in the similar manner as that of the first embodiment. A feature extractor 
30, an acoustic model composition module 40 and a speech recognition module 
50 carry out processes in the similar manner as that of the first embodiment. 

Because the automatic speech recognition system 100 according to this 
embodiment carries out speech recognition as a result of checking if a sound 
source can be tracked, it is able to keep carrying recognition for a speech 
uttered by the same sound source even if the sound source is moving, which 
will lead to a reduction in probability for false recognition. The automatic 
speech recognition system 100 is beneficial for a situation where there is a 
plurality of moving sound sources, which intersect each other. 

In addition, the automatic speech recognition system 100, which not 
only stores but also predicts sound directions, is able to decrease an amount of 
processing if searching for a sound source is limited to a certain area 
corresponding to a particular sound direction. 

While the embodiments of the present invention have been described, 
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the present invention is not limited to these embodiments, but can be 
implemented with various changes and modifications. 

One example is an automatic speech recognition system 1, which 
includes a camera, a well-known image recognition system and a speaker 
identification module, which recognizes a face of a speaker and identifies the 
speaker referring to its database. When the system 1 possesses 
direction-dependent acoustic models for each speaker, it is possible to compose 
an acoustic model appropriate for each speaker, which enables higher 
recognition rate. It may be possible to adopt an alternative, which introduces 
speeches of speakers registered in advance in the form of vector by vector 
quantization (VQ). The system 1 compares the registered speeches and a 
speech in the form of vector which the sound source separation module 20 
separates, outputting the resulting speaker having the smallest distance. 
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